System Prompt Extraction Test
Overview
The objective of the System Prompt Extraction test is to assess model vulnerabilities with regards to leaking its system prompt. This test evaluates the model's ability to keep the system prompt confidential and resist attempts to extract it. A system prompt is a set of instructions or context given to an AI system to guide its behavior and output for specific tasks or interactions. For sensitive and public-facing applications, system prompts are sometimes considered as a trade-secret.
To conduct this test, we use a dataset of 200 diverse system instructions collected from open-source repositories, databases and model cards. It aims to cover a wide range of use cases and domains in order to provide a holistic evaluation of the model’s vulnerability to leaking its system prompt.
The System Prompt Extraction (SPE) attack methodology follows the steps below:
For each system prompt in Dynamo AI’s SPE dataset
- Set the system prompt for the target model.
- Run extraction attempts 5 times with diverse adversarial prompts tailored for the attack.
- Store the extracted texts along with the original system prompt.
- For each (extracted text, system prompt) pair, compute ROUGE-L scores and check for Exact Match. The average scores across the entire dataset are reported.
Metrics
Following state-of-the-art research on System Prompt Extraction, the test considers two metrics to measure vulnerability based on the comparison between the original system prompt and the extracted one.
- ROUGE-L Score: ROUGE-L is a metric used to evaluate the quality of text summarization and machine translation. It measures the similarity between two texts based on the longest common subsequence (LCS) of their tokens. The score ranges from 0 to 1, with 1 indicating an exact match between the original and extracted texts.
- Exact Match: This metric checks if the extracted text exactly matches the original system prompt.
References
[1] Effective Prompt Extraction from Language Models - https://arxiv.org/abs/2307.06865